Big Data Analysis of Gas Prices using Apache Spark
This project analyzes 3 years of gas price data (2022-2024) from over 14,000 gas stations across France using Apache Spark. We build a complete data pipeline: from data collection to price forecasting.
| File | Description | File’s size |
|---|---|---|
| Prix2022S1.csv | Prices Jan-Jun 2022 | ~178MB |
| Prix2022S2.csv | Prices Jul-Dec 2022 | ~156MB |
| Prix2023.csv | Prices 2023 | ~322MB |
| Prix2024.csv | Prices 2024 | ~310MB |
| Stations2024.csv | Station locations | ~2.6K |
| Services2024.csv | Station services | ~980K |
The total price records is 14,214,837
Source: French Government Open Data via GitHub
Key Observations:
| Gas Type | Map |
|---|---|
| Gazole | View Map |
| SP98 | View Map |
| E10 | View Map |
| E85 | View Map |
|
Gazole |
SP98 |
|
E10 |
E85 |
Price Index Interpretation:
= 1.0 -> Station price equals national average> 1.0 -> More expensive than average (red)< 1.0 -> Cheaper than average (green)E85 and E10 is highly expensive in
CorseWe built models to predict next-day gas prices using lag features (past prices as inputs).
This prediction uses the Gazole gas .
| Model | R2 | RMSE | MAE |
|---|---|---|---|
| Linear Regression | 0.9569 | 0.0163 | 0.0121 |
| Random Forest | 0.9241 | 0.0216 | 0.0152 |
Interpretation: - Points close to the red diagonal = good predictions - Linear Regression performs slightly better (R2 = 0.9569)
Most Important Features: 1. price_lag_1 (35%) - Yesterday’s price is the best predictor 2. price_lag_2 (25%) - Price 2 days ago 3. price_rolling_7d (19%) - 7-day moving average 4. price_lag_3 (14%) - Price 3 days ago 5. price_lag_7 (4%) - Same day last week
Least Important from random forest:
day_of_week, month, day_of_month
(< 2% combined)
Conclusion: Gas prices are highly autocorrelated - recent prices are the best predictors of future prices. Calendar features have minimal impact.
| Tool | Purpose |
|---|---|
| PySpark | Big data processing (Billion of records) |
| Spark SQL | Data aggregation and queries |
| Spark ML | Machine learning pipelines |
| Matplotlib/Seaborn | Static visualizations |
| Folium | Interactive geographic maps |
| Pandas | Data manipulation for plotting |
├── data/
│ ├── Prix2022S1.csv
│ ├── Prix2022S2.csv
│ ├── Prix2023.csv
│ ├── Prix2024.csv
│ ├── Stations2024.csv
│ └── Services2024.csv
├──figures/
| |---gas_prices_evolution.png # Price evolution chart
| ├── model_dispersion_plot.png # Model evaluation
| ├── feature_importance.png # Feature importance
| ├── france_gas_prices_map*.html # Interactive maps
├── gas_consumption_france.ipynb # Main notebook
├── config.yaml # Configuration file
└── README.md
pip install numpy pandas pyyaml pyspark matplotlib seaborn geopandas folium
A java version installed
./data/ directorygas_consumption_france.ipynbPrice Index = 100 × (Station Price - National Average) / National Average + 1
Week Index = floor((Current Date - First Date) / 7) + 1
This project uses open data from the French government.